LOGISTIC REGRESSION
> setwd("C:/Users/baron/Dropbox/627 Data Analysis/data/My
data")
> Depr = read.csv("depression_data.csv")
#
Another way - reading data directly from the web site:
> Depr = read.csv(url("http://fs2.american.edu/~baron/627/R/depression_data.csv"))
> names(Depr)
[1] "ID" "Gender" "Guardian_status" "Cohesion_score"
[5] "Depression_score"
"Diagnosis"
> attach(Depr)
> fix(Data)
> summary(Diagnosis)
Min.
1st Qu. Median Mean 3rd Qu. Max.
NA's
0.0000 0.0000 0.0000
0.1572 0.0000 1.0000
2731
#
A lot of missing responses marked as NA. Omit them.
> Depr1 = na.omit(Depr)
> attach(Depr1); dim(Depr1)
[1] 458
6
#
Now, fit the logistic regression model.
> fit = glm( Diagnosis ~ Gender + Guardian_status
+ Cohesion_score, family = binomial )
> summary(fit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
1.00832 0.50478 1.998 0.04577 *
GenderMale
-0.68744 0.28848 -2.383
0.01718 *
Guardian_status -0.74835
0.28602 -2.616 0.00889 **
Cohesion_score -0.04358
0.01046 -4.167 3.09e-05 ***
#
All three variables are significant at 5% level, especially the cohesion score
(connection to community).
#
Cross-validation.
#
How well does our model predict within the training data?
> Prob = fitted.values(fit)
> summary(Prob)
Min.
1st Qu. Median Mean 3rd Qu. Max.
0.01958 0.07452 0.12990 0.15720 0.21560 0.57710
#
We’ll classify a student as having a depression if the
probability of that exceeds 0.3.
> YesPredict = 1*(Prob > 0.3) #
For all Prob > 0.3, we let YesPredict
= 1.
# For all Prob <= 0.3, we let YesPredict
= 0.
#
Then, create a table of true and predicted responses.
> table( Diagnosis, YesPredict )
YesPredict
Diagnosis
0 1
0
359 27
1
48 24
# This is not a perfect
result, there are some false positive and false negative diagnoses. Overall, we
correctly predicted (359+24)/458 = 83.6% of cases. The training error rate is only 16.7%. However, among the
students who are really depressed, we correctly
diagnosed only 1/3.
PREDICTION ACCURACY. Training data and test data
# As we know,
prediction error within the training data may be misleading since all responses
were known and used to develop our classification rule. To get a fair estimate
of the correct classification rate, let’s
(1)
Split the data into training and test
subsamples;
(2)
Develop the classification rule based on the training
data;
(3)
Use it to classify the test data;
(4)
Cross-tabulate
our prediction with the true classification.
>
n = length(ID)
>
Z = sample(n, n/2)
>
Depr.training = Depr1[ Z, ]
>
Depr.testing = Depr1[ -Z, ]
# Now fit the logistic
model using training data only.
> fit = glm( Diagnosis ~ Gender + Guardian_status
+ Cohesion_score, family = binomial, data = Depr.training )
# Use the obtained rule to classify the test data.
>
Prob = predict( fit, data.frame(Depr.testing),
type="response" )
> YesPredict = 1*( Prob > 0.3 )
# Cross-tabulate.
>
attach(Depr.testing)
> table( YesPredict, Diagnosis )
Diagnosis
YesPredict 0 1
0 174 22
1 23
14
# We still classify
80%+ of participants correctly. However, we correctly diagnose only 39% of
students who actually have depression.
Perhaps, gender, parents, and community are not
enough for correct depression diagnostics.
Receiver Operating Characteristic (ROC) Curve
Focus on the true positive rate and the false
positive rate for different thresholds.
True positive rate = P(
predict 1 | true 1 ) = power
False positive rate = P(
predict 1 | true 0 ) = false alarm
>
TPR = rep(0,100); FPR = rep(0,100);
>
for (k in 1:100){
+
fit = glm(Diagnosis ~ Gender
+ Guardian_status + Cohesion_score,
data=Depr1[Z,], family="binomial")
+
Prob = predict( fit, data.frame(Depr1[-Z,]), type="response" )
+
Yhat = (Prob > k/100 )
+
TPR[k] = sum( Yhat==1 &
Diagnosis==1 ) / sum( Diagnosis == 1 )
+
FPR[k] = sum( Yhat==1 &
Diagnosis==0 ) / sum( Diagnosis == 0 )
+
}
>
plot(FPR, TPR, xlab="False
positive rate", ylab="True positive
rate", main="ROC curve")
>
lines(FPR, TPR)
Prediction
Let’s predict the diagnosis for some particular
person, a female who lives with both parents, and has an extremely weak
connection with community.
> predict( fit, data.frame(
Gender="Female", Guardian_status=1, Cohesion_score=26 ))
-0.8730466
#
This is the predicted logit. Use the logistic function
to convert it into a probability
> Y0 = predict( fit,
data.frame( Gender="Female", Guardian_status=1, Cohesion_score=26
))
> P0 = exp(Y0)/(1+exp(Y0))
> P0
0.2946208
#
This can also be done by the type option.
> predict( fit, data.frame( Gender="Female", Guardian_status=1,
Cohesion_score=26 ), type="response")
0.2946208
#
A 29% chance of developing depression! Suppose she has
an average community connection instead.
> predict( fit, data.frame( Gender="Female", Guardian_status=1,
Cohesion_score=52 ), type="response")
0.1185683
#
Only an 11.85% chance now.